Characterization of Shigella flexneri in northern Vietnam in 2012–2016

Introduction. Shigellosis remains a considerable public health concern in developing countries. Shigella flexneri and Shigella sonnei are prevalent worldwide and S. sonnei has been replacing S. flexneri . Gap Statement. S. flexneri still causes outbreaks of shigellosis in northern Vietnam but limited information is available on its genetic characteristics. Aim. This study aimed to characterize the genetic characteristics of S. flexneri strains from northern Vietnam. Methodology. This study used 17 isolates from eight incidents, collected in northern Vietnam between 2012 and 2016. The samples were subjected to whole genome sequencing, molecular serotyping, cluster analysis and identification of antimicrobial resistance genes. Additionally, phylogenetic analysis was performed including isolates from previous studies. Results. Clusters were identified according to spatiotemporal backgrounds. The results suggested that two incidents in Yen Bai province in 2015 and 2016 were derived from a very recent common ancestor. All isolates belonged to phylogroup (PG) 3, which was divided into two sub-lineages. Thirteen of 17 isolates, including those from the Yen Bai incidents, belonged to sub-lineage Sub-1 and were serotyped as 1a. The remaining four isolates belonged to sub-lineage Sub-2 and were the globally predominant serotype 2a. The Sub-1 S. flexneri isolates possessed the gtrI gene, which encodes the glycosyl transferase that determines serotype 1a, with bacteriophage elements in the vicinity. Conclusion. This study revealed two PG3 sub-lineages of S. flexneri in northern Vietnam, of which Sub-1 might be specific to the region.


INTRODUCTION
Shigella is a Gram-negative, facultatively anaerobic, non-spore-forming, and non-motile bacterial genus belonging to the family Enterobacteriaceae; it is the aetiological agent of bacillary dysentery or shigellosis. Shigella is a common cause of diarrhoea in developing countries, leading to an estimated 269 million episodes and 212 000 deaths annually across all ages worldwide [1]. Shigella comprises four species: Shigella dysenteriae, S. flexneri, S. boydii and S. sonnei. Approximately 93 % of shigellosis cases are attributed to S. flexneri and S. sonnei [2]. The ratio of the two species differs among countries; the proportion of S. flexneri is higher in developing countries [3]. S. flexneri can be subdivided into serotypes based on the O-antigen component of LPS. There are six serogroups, 1-6, and serogroups 1-5 are further subdivided into serotypes such as 2a [4].

OPEN ACCESS
Phylogenetically, S. flexneri has two distinct lineages, one containing only S. flexneri serotype 6 and the other containing all other serotypes. The latter consist of seven phylogroups (PGs) that were determined by single nucleotide polymorphisms [5]. PGs 1-3 are dominant, accounting for 81 %. Although there are dominant serotypes in some PGs (e.g. serotype 1b, 3a and 2a for PG1, PG2 and PG3, respectively), each PG includes multiple serotypes as PGs are not dependent on serotypes. S. flexneri serotype 2a of PG3 is predominant globally [5]. S. sonnei has been increasing in prevalence and is replacing S. flexneri, even in developing countries, including in Southeast Asia [3,6]. In Vietnam, this species replacement is significant in the southern region [7,8]. However, shigellosis outbreaks due to S. flexneri continue to occur, and information on the genomic characterization remains limited. This study aimed to analyse S. flexneri isolates from incidents (sporadic cases and outbreaks) in northern Vietnam using whole genome sequencing (WGS) to uncover the characteristics of S. flexneri in this region.

Bacterial isolates
Seventeen S. flexneri isolates collected in northern Vietnam between 2012 and 2016 were used in this study. All were isolated in investigations notified by the local health centres, whose information had been anonymized. Sporadic cases and outbreak-related cases were probably included but detailed information was not available. We therefore use the term 'incident' . A summary of the isolates is presented in Table 1. The reference sequence data used in this study were ERR048305 for PG1, ERR048281 for PG2, AE014073 for PG3, ERR127046 for PG4, ERR832464 for PG5, ERR0848288 for PG6 and ERR048317 for PG7 [5,9,10].

Molecular typing and data analyses
The isolates were subjected to WGS. WGS data were obtained using the MiSeq platform (Illumina). Library preparation, genome assembly and single-nucleotide variant (SNV) extraction were performed as described previously [11]. Genomic libraries were prepared by the Nextera XT DNA Library Preparation Kit (Illumina) and sequenced on the MiSeq sequencer (Illumina). Genomic assembly was performed using SPAdes version 3.13.0 (RRID:SCR_000131) with the -careful option and a read coverage cutoff value of 10 [12]. The assembled contigs were incorporated into the BioNumerics software version 8.1 (Applied Maths) for core genome multilocus sequence typing (cgMLST) analysis and for identifying the antimicrobial resistance genes. cgMLST of BioNumerics includes 2513 loci extracted with the Core (Enterobase) setting from 14 837 whole genome MLST (wgMLST) loci (https://www.bionumerics.com/sites/default/files/extra/Release-Note-Eschericha-coli-Shigella-schema.pdf). A minimum similarity of 77.5 % was used for allele calls. Identification of antimicrobial resistance genes was performed using the Escherichia coli functional genotyping plug-in of the BioNumerics software with default settings (minimum identity of 95.0 % and minimum coverage of 95.0 % for blast; https://www.bionumerics.com/applications/functional-genotyping). A bio-neighbour joining tree was created by the BioNumerics software with the default settings (using the character data of cgMLST alleles as categorical, Merge taxa when distance is zero and No resampling). Molecular serotyping was performed using ShigaTyper version 2.0.2 and in silico PCR screening [13,14]. SNVs were identified using Snippy version 4.3.6 (https://github.com/tseemann/snippy) using the assembled sequences. S. flexneri strain 2457 T (accession no. AE014073), which is of serotype 2a and belongs to PG3, was used as a reference. SNVs in recombinogenic, repeat and prophage regions identified by Gubbins version 2.3.1 [15], NUCmer version 4.0.0rc1 [16] and phaster [17] were excluded. A maximum-likelihood phylogenetic tree was generated using RaxML version 8.2.12 [18] with -m GTRGAMMA option and 1000 bootstrap iterations. Sequence comparisons were performed using blastn and visualized using Easyfig [19]. Visualization of overall comparison of whole genome sequences were done using brig [20].

Sequence data
Nucleotide sequence data have been submitted to the DNA Data Bank of Japan (DDBJ) Sequence Read Archive under accession numbers DRR298131-DRR298147. The data have been deposited with links to DDBJ BioProject accession number PRJDB11754 in DDBJ. The assembled sequence data used in this study have been also deposited to DDBJ. Accession numbers are shown in Table 1.

Clusters found in WGS analysis
WGS analysis created clusters of isolates according to the spatiotemporal backgrounds (Fig. 1). The number of SNVs within each incident was within six, except for incident G of Cao Bang in 2016. One cluster comprised incident B of Son La in 2012. Another cluster comprised incident C of Dien Bien in 2014. The third comprised incidents E and F of Yen Bai in 2015 and 2016, respectively. Although the isolates of Yen Bai in 2016 formed a small branch in SNV analysis (Fig. 1), the number of SNVs was within six. This suggests that the two incidents might have been caused by a single source of contamination.
Two isolates from incident G of Cao Bang in 2016 were assigned next to each other in WGS analyses, although the number of variations was higher than in other clusters (31 SNVs).

Phylogroup, molecular serotyping and antimicrobial resistance genes
The isolates in this study were compared to representative strains of the phylogroups [6]. This suggested that all isolates in this study belonged to PG3 (Fig. 2). PG3 is the dominant phylogroup. The isolates could be divided into two sub-lineages in SNV analyses, named Sub-1 and Sub-2 ( Fig. 1). Based on molecular serotyping, 13 and four of 17 isolates were serotypes 1a and 2a, respectively. The distribution of serotypes of the 17 isolates in this study completely matched the two sublineages, and serotype 1a and 2a isolates belonged to Sub-1 and Sub-2, respectively. According to a previous study, PG3 includes multiple serotypes but serotype 2a is predominant followed by serotype 2b. Serotype 1a is the minor serotype and distributed in PG1 and PG3. PG3 with serotype 1a is quite minor, accounting for only 3 % of PG3 [5]. Data from three PG3 serotype 1a isolates from a previous study [5] were incorporated into the analyses. The three isolates, two from Vietnam (H04 and H10) and one from China (IB0034), were assigned to Sub-1 and Sub-2, respectively. Some antimicrobial resistance genes were also distributed specifically in the sub-lineages (Table 2). For example, dfrA1 was identified only in Sub-2 while dfrA14 was identified in Sub-1. All Sub-1 had the GyrA-S83L mutation and bla TEM-1B . In Sub-2, only one isolate in this study, P12012, had the GyrA-S83L mutation. No other types of gyrA mutation were found in this study. A recent  White nodes indicate the PG3 serotype 1a isolates of a previous study [5]. Branch length was calculated based on allele differences.   study identified two sub-lineages in PG3, Lin-3.1 and Lin-3.2 [8]. The resistance gene for trimethoprim is one of the key features for each Lin; Lin-3.2 was associated with dfrA1 while Lin-3.1 independently acquired a plasmid-borne dfrA14.
Another feature discriminating between Lin-3.1 and Lin-3.2 is quinolone resistance. Quinolone resistance emerged in the mid-1990s both in Lin-3.1 and in Lin-3.2, and became prevalent in the mid-2010s in Lin-3.1 but not in Lin-3.2 [8]. These features corresponded to those of the sub-lineages found in this study. Distributions of dfrA14 and quinolone resistance point mutations were significantly observed in Sub-1. Additionally, dfrA1 was only found in Sub-2. Thus, it was likely that Sub-1 and Sub-2 corresponded to Lin-3.1 and Lin-3.2.

Genetic background of serotype 1a isolates
The O-antigen modification gene responsible for serotype 1a is gtrI, which encodes a glycosyl transferase [21]. A 43 kb contig sequence (BSBY01000027) encompassing gtrI was identified in P15021. Contig BSBY01000027 was subjected to a blastn search. The top hit sequence was CP012735 of strain 0228 originating from China (coverage 100 % and identity 99.98 %), but not the original sequence of a gtrI-containing genetic element, AF139596 (Fig. 3). The matched region of CP012735 contained sequences of gtrI, integrase and putative bacteriophage components. Overall comparison between the whole genome sequences indicated that the gtrI-containing contig of P15021 corresponded to phage region 18 of strain 0228, which was lacking in the PG3 serotype 2a reference strain 2457 T (Fig. 4). Strain 0228 was assigned to Sub-2 ( Fig. 2). Thus, all serotype 1a isolates from Vietnam were assigned to Sub-1, whereas those from China (IB0034 and 0228) were assigned to Sub-2. A serotype 2a isolate, P12012, was assigned in the same branch as strains IB0034 and 0228. This branch is intriguing because multiple serotypes 1a and 2a were included and the phage region 18 responsible for serotype conversion was not conserved in strain IB0034 (Fig. 4). This suggests multiple ways to acquire serotype conversion.
Seroconversion is a major strategy for survival by evading the immune response of the host. Notably, S. flexneri 1a isolates of Vietnam were specifically assigned to Sub-1, including those of another study (H04 and H10) [5]. All the representative isolates of serotype 1a isolates in this study and the previous study had the common phage region 18 of strain 0228 identified by phaster [17] (Fig. 4). PG3 predominantly consists of serotype 2a. Serotype 1a isolates in this study were considered to be a quite minor part of PG3. It is difficult to conclude where and when acquisition of the genetic element for serotype 1a occurred because the number of isolates analysed was limited. Nevertheless, this study indicates that Vietnamese isolates of serotype 1a created a specific sub-lineage where the putative seroconversion phage was conserved, suggesting its circulation in northern Vietnam. Further surveillance would be needed to uncover the distribution of S. flexneri in Vietnam.
In conclusion, S. flexneri infection remains a public concern in northern Vietnam. Clusters of S. flexneri were observed based on WGS analysis. All isolates used in this study belonged to PG3 and consisted of two sub-lineages where serotypes 1a and 2a were distributed specifically.

Five reasons to publish your next article with a Microbiology Society journal
1. When you submit to our journals, you are supporting Society activities for your community.
2. Experience a fair, transparent process and critical, constructive review.
3. If you are at a Publish and Read institution, you'll enjoy the benefits of Open Access across our journal portfolio. 4. Author feedback says our Editors are 'thorough and fair' and 'patient and caring'. 5. Increase your reach and impact and share your research more widely.
Find out more and submit your article at microbiologyresearch.org.

Comments:
The work presented is clear and the arguments well formed. This is a study that would be of interest to the field and community.

Reviewers' comments and responses to custom questions:
Please rate the manuscript for methodological rigour

Reviewer 2: Good
Please rate the quality of the presentation and structure of the manuscript There's still information missing that was previously requested and would be required for others to repeat this work: Line 85. I appreciate the extra information included, but what were the cut-offs for defining the genes to be core, what percentage identity and coverage is required? Bionumerics is not a freely available software, therefore it is not easy for a reader to access that information.
> Thank you for your comment. As you pointed out, Bionumerics is a commercial software and available information is limited. We have included as much as information on cgMLST of Bionumerics that can be obtained in Bionumerics and the URL sites of the software.
Line 86: As above, what are the default settings? This software is not freely available and therefore this is not easy to access information for the reader.
> Thank you for your comment. As above, we have included as much as information on E. coli functional genotyping of Bionumerics that can be obtained in Bionumerics and the URL sites of the software. > Thank you for your comment. Sorry for not mentioning the model used. We used GTRGAMMA and have revised the manuscript.
Line 101. Thank you for uploading the genome assemblies.

Presentation of results
The figure legends ought to be expanded, currently they do not fully describe what is presented in the figures and lack detail that would be beneficial for the reader to interpret the figures.    >No specific parameters are available for creating BNJ tree in Bionumerics. We are sorry for that.    The comparison region has been altered from the previous version, omitting the 5' region of the CP012735 sequence (region previously spanned 4650000-4654000 bp, now it is 4659000-4699000 bp. PHAST analysis seems to reveal the phage region from CP012735 to be the region 4646913-4698561 bp (which is represented in Figure 4, and I feel these should be consistent).
> Thank you for your comment. We have amended Figure 3 and legend.  3. How the style and organization of the paper communicates and represents key findings The organisation and layout of the findings is fine.

Literature analysis or discussion
Thank you for expanding on the findings and adding further context and discussion.
Line 55: Please expand on this sentence "Phylogenetically, S. flexneri encompasses seven phylogroups (PGs), excluding S. flexneri serotype 6.", to explain how the phylogroups were determined by Connor et al, and including this reference (used already in your manuscript as number 5). This is important as serotypes aren't specific to a PG (as you see with your work and particularly due to the large number of serotypes present in PG3) and the definition of a phylogroup is not dependent on that. I think it is important for the reader to understand that, so adding this additional information would be helpful. > Thank you for your comment. We have amended the sentences.
Line 141 please change sequence027 to the accession number: BSBY01000027. > Thank you for your comment. We have amended the sentence.
Line 142 The following "The contig sequence027 (Accession number, BSBY01000027" should be changed to "The contig BSBY01000027" due to the previous change requested.
> Thank you for your comment. We have amended the sentence.
Line 155 to 164: Thank you for expanding on these findings. I feel my previous comment was misunderstood (apologies). I was not referring to a transposon, but transposase genes associated with the phage element in 0228, which presumably would be required for movement of the phage out of and into the genome. Those genes appeared to be lost in isolate P15021. However, examining the BRIG figure it appears those genes may be present (given homology is detected) but most likely they are located on a different contig. Therefore, my previous comment is not relevant. Thank you for doing this extra analysis.

Any other relevant comments
The additional information added to the manuscript is much appreciated, particularly the additional information and discussion in the "Results and Discussion" section.
Comments: 1. Methodological rigour, reproducibility and availability of underlying data There's still information missing that was previously requested and would be required for others to repeat this work: Line 85. I appreciate the extra information included, but what were the cut-offs for defining the genes to be core, what percentage identity and coverage is required? Bionumerics is not a freely available software, therefore it is not easy for a reader to access that information. Line 86: As above, what are the default settings? This software is not freely available and therefore this is not easy to access information for the reader.  Figure 1: Still no mention of models used to produce the tree. Figure 1: I think for clarity it would be good to have a key for serotype 1a/2a (red/black) and for dfrA1/A14 (black/red). This information is in the figure legend but it would make it clearer to have it on the actual figure and red text for 1a and A14 would be unrequired (it makes it hard to read). Figure 1: Error in the figure assigning isolate 0228 to the Lao Cai region, which ought to be for isolate P12012. Figure 2: Parameters for the BioNJ tree still not included. Figure 3: Specific contig information is BSBY01000027, not sequence027. As reference is made to BSBY01000027 in the main text (as it should be), please change sequence027 to the accession number. Figure 3: Information that was previously in the figure legend has been removed, that ought to have remained. It would be beneficial for the reader to have the following information in the figure legend, to link to the figure: Compared sequences were obtained from, S. flexneri 0228 (accession number: CP012735), S. flexneri P15021 (accession number: BSBY01000027), S. flexneri Y53 (accession number: AF139596). Figure 3: Please include information that allows readers to know that the bands connecting the sequences represents nucleotide identity, and link to the key present on the figure. Figure 3: The comparison region has been altered from the previous version, omitting the 5' region of the CP012735 sequence (region previously spanned 4650000-4654000 bp, now it is 4659000-4699000 bp. PHAST analysis seems to reveal the phage region from CP012735 to be the region 4646913-4698561 bp (which is represented in Figure 4, and I feel these should be consistent). Figure 4: The figure resolution is rather low, which makes it difficult to read. Figure 4: Figure  legend needs to be expanded to describe what the BRIG image represents e.g. that the rings represent BLAST nucleotide identity (presumably?), what sequences are presented within the rings, GC skew and GC% etc. 3. How the style and organization of the paper communicates and represents key findings The organisation and layout of the findings is fine. 4. Literature analysis or discussion Thank you for expanding on the findings and adding further context and discussion. Line 55: Please expand on this sentence "Phylogenetically, S. flexneri encompasses seven phylogroups (PGs), excluding S. flexneri serotype 6.", to explain how the phylogroups were determined by Connor et al, and including this reference (used already in your manuscript as number 5). This is important as serotypes aren't specific to a PG (as you see with your work and particularly due to the large number of serotypes present in PG3) and the definition of a phylogroup is not dependent on that. I think it is important for the reader to understand that, so adding this additional information would be helpful. Line 141 please change sequence027 to the accession number: BSBY01000027. Line 142 The following "The contig sequence027 (Accession number, BSBY01000027" should be changed to "The contig BSBY01000027" due to the previous change requested. Line 155 to 164: Thank you for expanding on these findings. I feel my previous comment was misunderstood (apologies). I was not referring to a transposon, but transposase genes associated with the phage element in 0228, which presumably would be required for movement of the phage out of and into the genome. Those genes appeared to be lost in isolate P15021. However, examining the BRIG figure it appears those genes may be present (given homology is detected) but most likely they are located on a different contig. Therefore, my previous comment is not relevant. Thank you for doing this extra analysis. 5. Any other relevant comments The additional information added to the manuscript is much appreciated, particularly the additional information and discussion in the "Results and Discussion" section.

Please rate the quality of the presentation and structure of the manuscript Good
To what extent are the conclusions supported by the data? Strongly support

If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines? Yes
Author response to reviewers to Version 1 Comments 1. In Abstract, "The results suggested that two incidents in Yen Bai province in 2015 and 2016 were attributable to a single source" The isolates from two incidents are genetically closely related, indicating that they are derived from a very recent common ancestor but they are not necessarily come from a common source. They may likely be attributable to a common source only. > Thank you for your comment. We have amended the sentence. (Line 35) 2. Lines 54-55, "Phylogenetically, S. flexneri encompasses seven phylogroups (PGs), excluding S. flexneri serotype 6, S. flexneri serotype 2a of PG3 is predominant globally" The authors should briefly describe the 7 phylogroups of S. flexneri in the Introduction or Discussion, and provides more details regarding PG3. > Thank you for your comment. We have amended the sentence. (Lines 53-54) 3. Line 65 "3.1 Bacterial isolates" All isolates (genomes) used as references for the phylogenetic studies should be described in Methods. > Thank you for your comment. We have added the information. (Lines 69-71) 4. Table 1.
The reference strains used in the phylogenetic analysis should be included in Table 1, with relevant information (serotype, PGs, etc.). The accession numbers for the genomic sequences of the reference strains can be added to this Table. > Thank you for your comment. We have amended Table 1. 5. Table 2.
The information on incidents and sublineages should be added since the information has been mentioned in the Results and Discussion.
> Thank you for your comment. We have amended Table 2. 6. Figure 1   > Thank you for your comment. We have amended Figure 1 in the current version. Reference strain was 2457T, not 301. We have corrected it. Sorry for the confusion.
7. Figure 2 More information, e.g. year of isolation and incident, can be added in Figure 2. Serotypes 1a/2a, as well as dfrA1/dfrA14, can be listed on the same column in different colors. FEF should be replaced by the strain name (S. flexneri 301 or Sf301) as other 4 references (0228, IB0034, H10, H04) are indicated by strain names.
> Thank you for your comment. We have amended Figure 1 in the current version.
8. Lines 236-237, the description "Boxes indicate serotype (1a/2a), dfr genes (A1/A17), and provinces. The color indicative of the province is the same as in Fig. 1. The location of the province is shown on the right" The figure legend is not necessary. > Thank you for your comment. We have removed the figure legend. 9. Figure 3 The PGs (phylogroup) and strain names of the reference strains should be indicated. The characteristics of the reference strains should be described in Table 1 and the accession numbers should be given in Methods or Table 1. If Figure 3 was generated using BioNmerics, the authors can generate the figure in "Metafile" format that can be "degrouped" using Powerpoint to allow the objects (lines, circles, texts) to be modified or revised. Lines 242-243, the description "Run accession numbers of representatives of phylogroups are shown along with nodes. Red nodes indicate the isolates in this study" needs to be revised. > Thank you for your comment. We have amended Figure 2 in the current version.
10. The methodology is appropriate but requires additional information in order for other researchers to be able to reproduce this work: No mention of settings for BioNJ tree or MST tree (what algorithm was used?). This information needs to be included.
Line 79 -no mention of specific settings for cgMLST: How many alleles were in the core genome MLST scheme? What were the parameters used for defining "core genome"?
Line 80 -What were the settings for AMR analysis (minimum percentage for coverage and identity)?
Line 82 -Were assemblies used for the Snippy analysis or were the reads used? This information should be included.
Line 82 -Would be useful to mention the S. flexneri strain 301 is Serotype 2a, as it is relevant to the work undertaken.
Line 85 -What substitution model was used and what model was used for calculating the rate of heterogeneity?
Line 90 -With regard to data availability, as assemblies were produced and used in the analysis, they should be deposited in DDBJ as well as the reads. > Thank you for your comments. We have amended the sentences (Lines 78-93). Reference strain was 2457T, not 301. We have corrected it. Sorry for the confusion. The assemblies were deposited to DDBJ as well (Lines 97-98, and Table 1)           > Thank you for your comments. We have amended Figure 3 in the current version.

How the style and organization of the paper communicates and represents key findings
The organisation and layout of the findings is fine.

Literature analysis or discussion
There is a lot of discussion that could be included to go along with the findings. As well as expanding on the findings discussed: Line 116-118 -Might be good to mention that the only sub-2 isolate to have gyrA mutations lays within its own clade compared to the other sub-2 isolates? This is interesting and worth discussing.
Line 118 -The authors mention "It was likely that Sub-1 and Sub-2 would correspond to Lin-3.1 and Lin-3.2 of PG3 in a recent study, respectively" -Please expand on this to explain reasoning or link back to previous observations made to explain reasoning. This is another finding that would be interesting to expand on and put into context.
Line 121 -Do all the serotype 1a isolates have this region containing gtrI? If so, is it the same as the region found within P15021? If it isn't the same, how is it different?
Line 122 -What is the definition of "most matched"? Coverage and identity information should be included.
Line 123 -It is not significant that there is a higher degree of sequence coverage and identity (if that is what is meant but "most matched") between the CP012735 and P15021 compared to just the gtrI region, because the gtrI region only contains the immediate genes. It is not a "like for like" comparison. It would be more relevant if you compared on the gtrI region of CP012735, P15021 and the original gtrI sequence. However, overall comparison between CP012735 and P15021 is important on its own.
Line 124-126 -This should be expanded upon. What you have found is that the gene responsible for serotype 1a assignment appears to be carried on a mobile element, which is why the isolates from China group in Sub-2, though they are serotype 1a (which you would expect to group in Sub-1).
> Thank you for your comments. We have amended the sentences (Lines 133-146).
The authors go on to mention seroconversion being a strategy for survival and it appears the isolates from China have undergone seroconversion (as they group phylogenetically with serotype 2a isolates). Though there is no evidence presented here that this has occurred for the North Vietnam isolates. An additional reason to expand on your findings as it is an interesting discussion point.
> Thank you for your comments. We have amended the sentences (Lines 145-155).
Line 127-130 -No mention of loss of transposase genes in the genomic island within the North Vietnam isolates of serotype 1a, suggesting they are unable to undergo seroconversion. This would be interesting to expand upon, particularly with the findings regarding the isolates from China that are serotype 1a and group with the serotype 2a isolates from this study found in North Vietnam.
> Thank you for your comments. Transposon is one of the mechanisms of mobile genetic elements, but this is not the only mechanism. Phage-mediated transfer of genes could be responsible for seroconversion as well. The phage element was found to be conserved among the isolates in this study ( Figure 4 in the current version) and we consider the phage element would be more responsible than transposon to transfer such a large genetic element. So, we didn't mention about transposase. We have amended the sentences (Lines 145-155) and we hope you understand it. Overall this is an interesting paper, characterising Shigella flexneri strains from Northern Vietnam. I feel the paper would benefit from including a discussion about the findings presented, rather than just presenting the findings. There are a lot of interesting results here that lack context or expansion, therefore are not highlighted as well as they could be. tree or MST tree (what algorithm was used?). This information needs to be included. Line 79 -no mention of specific settings for cgMLST: How many alleles were in the core genome MLST scheme? What were the parameters used for defining "core genome"? Line 80 -What were the settings for AMR analysis (minimum percentage for coverage and identity)? Line 82 -Were assemblies used for the Snippy analysis or were the reads used? This information should be included. Line 82 -Would be useful to mention the S. flexneri strain 301 is Serotype 2a, as it is relevant to the work undertaken. Line 85 -What substitution model was used and what model was used for calculating the rate of heterogeneity? Line 90 -With regard to data availability, as assemblies were produced and used in the analysis, they should be deposited in DDBJ as well as the reads. 2. Presentation of results Table 2: Adding either serotype information to this table and/or sub-lineage information would be helpful, particularly with regard to observations made. Figure 1: Would be helpful to have the information about the incident included. Could replace province abbreviations with incident information, as the colour coding already denotes the province. Figure 1: Additional information regarding settings for MST tree would be useful. Figure 1: What are the branch lengths? Allele differences? Figure 2: Bootstrap scores are missing from the tree. Figure 2: Is the tree unrooted? Figure 2: Information about the models used to produce the tree should be included in the figure legend (as with Figure 1). Figure 2: Typo: dfr genes (A1/A14) not (A1/A17) Figure 3: Annotation on the tree denotating each Phylogroup would be helpful, to allow for comparison. Figure 3: Again, what parameters were used for the BioNJ tree? Figure 3: What are the branch lengths? Allele differences? Figure 3 Figure 4: Needs to be annotated properly. Information should be added to the figure about which sequence belongs to which source (name only, spanning region not required and can be left in the legend), rather than just in the figure legend. The specific contig information for isolate P15021 should be included. 3. How the style and organization of the paper communicates and represents key findings The organisation and layout of the findings is fine. 4. Literature analysis or discussion There is a lot of discussion that could be included to go along with the findings. As well as expanding on the findings discussed: Line 116-118 -Might be good to mention that the only sub-2 isolate to have gyrA mutations lays within its own clade compared to the other sub-2 isolates? This is interesting and worth discussing. Line 118 -The authors mention "It was likely that Sub-1 and Sub-2 would correspond to Lin-3.1 and Lin-3.2 of PG3 in a recent study, respectively" -Please expand on this to explain reasoning or link back to previous observations made to explain reasoning. This is another finding that would be interesting to expand on and put into context. Line 121 -Do all the serotype 1a isolates have this region containing gtrI? If so, is it the same as the region found within P15021? If it isn't the same, how is it different? Line 122 -What is the definition of "most matched"? Coverage and identity information should be included. Line 123 -It is not significant that there is a higher degree of sequence coverage and identity (if that is what is meant but "most matched") between the CP012735 and P15021 compared to just the gtrI region, because the gtrI region only contains the immediate genes. It is not a "like for like" comparison. It would be more relevant if you compared on the gtrI region of CP012735, P15021 and the original gtrI sequence. However, overall comparison between CP012735 and P15021 is important on its own. Line 124-126 -This should be expanded upon. What you have found is that the gene responsible for serotype 1a assignment appears to be carried on a mobile element, which is why the isolates from China group in Sub-2, though they are serotype 1a (which you would expect to group in Sub-1). The authors go on to mention seroconversion being a strategy for survival and it appears the isolates from China have undergone seroconversion (as they group phylogenetically with serotype 2a isolates). Though there is no evidence presented here that this has occurred for the North Vietnam isolates. An additional reason to expand on your findings as it is an interesting discussion point. Line 127-130 -No mention of loss of transposase genes in the genomic island within the North Vietnam isolates of serotype 1a, suggesting they are unable to undergo seroconversion. This would be interesting to expand upon, particularly with the findings regarding the isolates from China that are serotype 1a and group with the serotype 2a isolates from this study found in North Vietnam. 5. Any other relevant comments General typos: Line 46 -Gram instead of gram Line 121 -Kbp instead of Kb (particularly because the authors use Kbp in figure 4) Line 122 -nBLAST instead of BLAST Overall this is an interesting paper, characterising Shigella flexneri strains from Northern Vietnam. I feel the paper would benefit from including a discussion about the findings presented, rather than just presenting the findings. There are a lot of interesting results here that lack context or expansion, therefore are not highlighted as well as they could be.

Please rate the manuscript for methodological rigour Satisfactory
Please rate the quality of the presentation and structure of the manuscript Satisfactory Comments: The authors characterize 17 Shigella flexneri isolates from 8 incident events that occurred in 5 provinces in Northern Vietnam. All isolates belong to phylogroup 3 and are distributed in 2 sublineages, Sub-1 and Sub-2. The manuscript is concise; however, the tables and Figures need to be revised or modified. Comments 1. In Abstract, "The results suggested that two incidents in Yen Bai province in 2015 and 2016 were attributable to a single source" The isolates from two incidents are genetically closely related, indicating that they are derived from a very recent common ancestor but they are not necessarily come from a common source. They may likely be attributable to a common source only. 2. Lines 54-55, "Phylogenetically, S. flexneri encompasses seven phylogroups (PGs), excluding S. flexneri serotype 6, S. flexneri serotype 2a of PG3 is predominant globally" The authors should briefly describe the 7 phylogroups of S. flexneri in the Introduction or Discussion, and provides more details regarding PG3. 3. Line 65 "3.1 Bacterial isolates" All isolates (genomes) used as references for the phylogenetic studies should be described in Methods. 4. Table 1. The reference strains used in the phylogenetic analysis should be included in Table 1, with relevant information (serotype, PGs, etc.). The accession numbers for the genomic sequences of the reference strains can be added to this Table. 5. Table 2. The information on incidents and sublineages should be added since the information has been mentioned in the Results and Discussion. 6. Figure 1 and Figure 2 Figure 1 and Figure 2 give phylogenetic relationships among the same panel of isolates. The phylogenetic structures constructed using the two methods are highly similar. Figure 2 contains more meaningful data. Figure 1 should be removed. 7. Lines 109-111, "The distribution of serotypes completely matched the two sub-lineages, and serotype 1a and 2a isolates belonged to Sub-1 and Sub-2, respectively" However, in Figure 2, strain 0228 is serotype 1a but belongs to Sub-2. The description should be revised. 8. Lines 228-231, "White circles show strains of reference ("REF") and the previous study [5]. The color indicates the province: green, Yen Bai; red, Son La; purple, Dien Bien; yellow, Cao Bang; and light blue, Lao Cai. Branch lengths are shown on the branches. Abbreviations of the province and isolation year are indicated on each node" White circles indicate all reference strains. What is REF? S. flexneri 301? The circles in the figure should be labeled with the isolate/strain names to keep consistency. The provinces of the isolates recovered have been indicated in colors illustrated on the figure legend, it is not necessary to state in the legend of the figure. References (indicated in white color) should be indicated in the figure legend. Since the figure generated using the tools provided in BioNumerics, it can be saved in "Metafile" format and can be edited using the Microsoft Powerpoint. 7. Figure 2 More information, e.g. year of isolation and incident, can be added in Figure 2. Serotypes 1a/2a, as well as dfrA1/dfrA14, can be listed on the same column in different colors. FEF should be replaced by the strain name (S. flexneri 301 or Sf301) as other 4 references (0228, IB0034, H10, H04) are indicated by strain names. 8. Lines 236-237, the description "Boxes indicate serotype (1a/2a), dfr genes (A1/A17), and provinces. The color indicative of the province is the same as in Fig. 1. The location of the province is shown on the right" The figure legend is not necessary. 9. Figure  3 The PGs (phylogroup) and strain names of the reference strains should be indicated. The characteristics of the reference strains should be described in Table 1 and the accession numbers should be given in Methods or Table 1. If Figure 3 was generated using BioNmerics, the authors can generate the figure in "Metafile" format that can be "degrouped" using Powerpoint to allow the objects (lines, circles, texts) to be modified or revised. Lines 242-243, the description "Run accession numbers of representatives of phylogroups are shown along with nodes. Red nodes indicate the isolates in this study" needs to be revised. 10. Please rate the manuscript for methodological rigour Satisfactory